Iterative Random Forests to detect predictive and stable high-order interactions
نویسندگان
چکیده
Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genomewide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that operate in vivo as components of larger molecular machines that regulate gene expression. Understanding these processes and the high-order interactions that govern them presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RIT), and through extensive, biologically inspired simulations, we developed iterative Random Forests (iRF). iRF leverages the Principle of Stability to train an interpretable ensemble of decisions trees and detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity for the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, iRF re-discovered the essential role of zelda (zld) in early zygotic enhancer activation, and novel third-order interactions, e.g. between zld, giant (gt), and twist (twi). In human-derived cells, iRF re-discovered that H3K36me3 plays a central role in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry in genome biology, automating hypothesis generation for the discovery of new molecular mechanisms from high-throughput, genome-wide datasets.
منابع مشابه
Iterative random forests to discover predictive and stable high-order interactions
Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge...
متن کاملStability of variable importance scores and rankings using statistical learning tools on single-nucleotide polymorphisms and risk factors involved in gene × gene and gene × environment interactions
Risk of complex disorders is thought to be multifactorial, involving interactions between risk factors. However, many genetic studies assess association between disease status and markers one single-nucleotide polymorphism (SNP) at a time, due to the high-dimensional nature of the search space of all possible interactions. Three ensemble methods have been recently proposed for use in high-dimen...
متن کاملRandom forests algorithm in podiform chromite prospectivity mapping in Dolatabad area, SE Iran
The Dolatabad area located in SE Iran is a well-endowed terrain owning several chromite mineralized zones. These chromite ore bodies are all hosted in a colored mélange complex zone comprising harzburgite, dunite, and pyroxenite. These deposits are irregular in shape, and are distributed as small lenses along colored mélange zones. The area has a great potential for discovering further chromite...
متن کاملComparison of Stability Parameters for Detection of Stable and High Essential Oil Yielding Landraces of Rosa damascena Mill.
The essential oil yield stability of damask rose (Rosa damascena Mill.) as an important medicinal and aromatic plant in different environments has not been well documented. In order to determine appropriate stability parameters, six statistics were studied for essential oil stability of 35 Rosa damascena landraces in seven locations (Sanandaj, Arak, Kashan, Dezful, Stahban, Ke...
متن کاملتحلیل الگوی مکانی و اثرات متقابل بلوط ایرانی و بنه در جنگلهای قلاجه کرمانشاه با استفاده از تابع K2
Quercus brantii Lindl. and Pistacia atlantica Desf. are the most important tree species in Zagros forests, The abundant use of these trees by the inhabitants of the area has led to a reduction in the quality and quantity of these valuable species, as well as the creation of heterogeneous masses.Recognizing the spatial pattern and the interactions of trees can be a key to managerial interve...
متن کامل